We're going to build and compare a few malware machine learning models in this series of Jupyter notebooks. Some of them require a GPU. I've used a Titan X GPU for this exercise. If yours isn't as beefy, you may get tensorflow memory errors that may require modifying some of the code, namely file_chunks
and file_chunk_size
. (I'll point to it later.) But, to get started, the first few exercises will work on even that GPU you're embarrassed to tell people about, or if you're willing to wait, no GPU at all.
For the fancy folks who have multiple GPUs, we're going to restrict usage to the first one.
In [1]:
%env CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU
Also note that this exercise assumes you've already populated a malicious/
and a benign/
directory with samples that you consider malicious and benign, respectively. How many samples? In this notebook, I'm using 50K of each for demonstration purposes. Sadly, you must bring your own. If you don't populate these subdirectories for binaries (each renamed to the sha256 hash of its contents!), the code will bicker and complain incessently.
There is a lot of domain knowledge on what malware authors can do, and what malware authors actually do when crafting malicious files. Furthermore, there are some things malware authors seldom do that would indicate that a file is benign. For each file we want to analyze, we're going to encapsulate that domain knowledge about malicious and benign files in a single feature vector. See the source code at classifier/pefeatures.py.
Note that the feature extraction we use here contains many elements from published malware classification papers. Some of those are slightly modified. And there are additional features in this particular feature extraction that are included because, well, they were just sitting there in the LIEF parser patiently waiting for a chair at the feature vector table. Read: there's really no secret sauce in there, and to turn this into something commercially viable would take a bit of work. But, be my guest.
A note about LIEF. What a cool tool with a great mission! It aims to parse and manipulate binary files for Windows (PE), Linux (ELF) and MacOS (macho). Of course, we're using only the PE subset here. At the time of this writing, LIEF is still very much a new tool, and I've worked with the authors to help resolve some kinks. It's a growing project with more warts to find and fix. Nevertheless, we're using it as the backbone for features that requires one to parse a PE file.
In [2]:
from classifier import common
In [3]:
# this will take a LONG time the first time you run it (and cache features to disk for next time)
# it's also chatty. Parts of feature extraction require LIEF, and LIEF is quite chatty.
# the output you see below is *after* I've already run feature extraction, so that
# X and sample_index are being read from cache on disk
X, y, sha256list = common.extract_features_and_persist()
# split our features, labels and hashes into training and test sets
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(123)
X_train, X_test, y_train, y_test, sha256_train, sha256_test = train_test_split( X, y, sha256list, test_size=1000)
# a random train_test split, but for a malware classifier, we should really be holding out *future* malicious and benign
# samples, to better capture how we'll generalize to malware yet to be seen in the wild. ...an exercise left to the reader..
We'll use the features we extracted to train a multilayer perceptron (MLP). An MLP is an artificial neural network with at least one hidden layer. Is a multilayer perceptron "deep learning"? Well, it's a matter of semantics, but "deep learning" may imply that the features and model are optimized together, end-to-end. So, it that sense, no: since we're using domain knowledge to extract features, then pass it to an artificial neural network, we'll remain conservative and call this an MLP. (As we'll see, don't get fooled just because we're not calling this "deep learning": this MLP is no slouch.) The network architecture is defined in classifier/simple_multilayer.py.
In [4]:
# StandardScaling the data can be important to multilayer perceptron
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
# Note that we're using scaling info form X_train to transform both
X_train = scaler.transform(X_train) # scale for multilayer perceptron
X_test = scaler.transform(X_test)
from classifier import simple_multilayer
from keras.callbacks import LearningRateScheduler, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
model = simple_multilayer.create_model(
input_shape=(X_train.shape[1], ), # input dimensions
input_dropout=0.05, # this prevents the model becoming a fanboy of (overfitting to) any particular input feature
hidden_dropout=0.1, # same, but for hidden units. Dropping out hidden layers can create a sort of ensemble learner
hidden_layers=[4096, 2048, 1024, 512] # this is "art". making up # of hidden layers and width of each. don't be afraid to change this
)
model.fit(X_train, y_train,
batch_size=128,
epochs=200,
verbose=1,
callbacks=[
EarlyStopping( patience=20 ),
ModelCheckpoint( 'multilayer.h5', save_best_only=True),
ReduceLROnPlateau( patience=5, verbose=1)],
validation_data=(X_test, y_test))
from keras.models import load_model
# we'll load the "best" model (in this case, the penultimate model) that was saved
# by our ModelCheckPoint callback
model = load_model('multilayer.h5')
y_pred = model.predict(X_test)
common.summarize_performance(y_pred, y_test, "Multilayer perceptron")
# The astute reader will note we should be doing this on a separate holdout, since we've explicitly
# saved the model that works best on X_test, y_test...an exercise for left for the reader...
Out[4]:
Alright. Is that good? Let's compare to another model. We'll reach for the simple and reliable random forest classifier?
One nice thing about tree-based classifiers like a random forest classifier is that they are invariant to linear scaling and shifting of the dataset (the model will automatically learn those transformations). Nevertheless, for a sanity check, we're going to use the scaled/transformed features in a random forest classifier.
In [5]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# you can increase performance by increasing n_estimators, and removing the restriction on max_depth
# I've kept those in there because I want a quick-and-dirty look at how the MLP above
rf = RandomForestClassifier(
n_estimators=40,
n_jobs=-1,
max_depth=30
).fit(X_train, y_train)
y_pred = rf.predict_proba(X_test)[:,-1] # get probabiltiy of malicious (last class == last column )
_ = common.summarize_performance(y_pred, y_test, "RF Classifier")
Really, it's not a terrible model, but it's nothing special. But, we'd really like to get to the realm of > 99% true positive rate at < 1% false positive rate.
Seems like we can do one of two things here:
Hey, end-to-end deep learning disrupted object detection, image recognition, speech recognition and machine translation. And that sounds way more interesting than item 1, so let's pull out some end-to-end deep learning for static malware detection!